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Abstract 

This papere provides formulas for expected true-score measures and reliability of binary items as a 
function of their Rasch difficulty parameters when the trait distribution is normal or logistic. With 
the proposed formulas, one can evaluate the theoretical values of classical reliability indexes for 
norm-referenced and criterion-referenced interpretations without information about raw-score or 
trait scores of persons from the target population. This is achieved by representing the theoretical 
(marginalized) values of the true-score components of reliability indexes as functions of the item 
difficulty parameter. As the analytic forms of such functions are developed for individual items 
(and then "summarized" at test level), one can know the population values of true-score measures 
and reliability for a set of Rasch calibrated binary items prior to their administration. An example 
for the application of the proposed formulas and their empirical validation is also provided. 
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Reliability of True Cutting Scores for Rasch Calibrated Items 

Despite disadvantages of the True-Score Model (TSM) in metric development and 
accuracy of measurement compared to item-response theory models (e.g., Hambleton and Jones, 
1993) and Rasch models (e.g., Linacre, 1997; Smith, 2000, 2001), the TSM has been and is still 
used in test development and test score analysis. Traditionally, true-score measures represent a 
common focal point between test developers and practitioners as they place the scores and their 
accuracy in the original scale of measurement [e.g., number-right (NR) scores.]. True scores are 
are readily interpretable and, for example, when pass-fail decisions are made, a cutting score is 
typically set on the domain-score scale (e.g., Hambleton, Swaminathan, &. Rogers, 1991, p. 85). 
Recent debates and editorial policies on issues of reliability (e.g., Dimitrov, 2002; Sawilowsky, 
2000; Thompson &. Vacha-Haase, 2000) also indicate the necessity of adequate understanding 
and estimation of TSM reliability and standard error of measurement at sample and population 
level. In this paper, marginal population values of true-score measures for individual binary items 
are determined from their Rasch difficulty parameters. In a previous work (Dimitrov, 2002) this 
has been achieved only for the marginal item score and item error variance. This paper completes 
the work by providing analytic evaluations for item true variance, item reliability, classical test 
reliability, and dependability of cutting scores for criterion-referenced ("pass/fail") decisions. 

The proposed formulas have theoretical value and can be very useful in test development, score 
analysis, and simulation studies. For example, given a bank of binary items calibrated with the 
dichotomous Rasch model (Rasch, 1960), one can select items with known true-score measures 
and reliability prior to administering the test. 

It is important to note that the information provided with the proposed formulas and the 
information obtained through Rasch analysis can complement (not replace or exclude) each other 
in measurement analysis. For example, the TSM reliability evaluated with the method developed 
in this article provides more information about the accuracy of measurement at population level 
relative to classical coefficients such as Cronbach’s alpha (Cronbach, 1951), but it cannot replace 
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the information provided by Rasch reliability measures for locating persons on the underlying trait 
(e.g., Linacre, 1996, 1997; Wright, 2001). Other comments on this issue that follow later in the 
text also support the argument that researchers and practitioners will benefit from combining 
Rasch measurement information with TSM information about population measures provided by 
formulas developed in this paper. 

Theoretical Framework 

As the title of the paper indicates, the reliability of true cutting scores for Rasch calibrated 
binary items is a primary target of the study. It should be noted, however, that the "intermediate" 
results (formulas) developed in this paper in achieving this task have their own (methodological 
and technical ) value in Rasch test development and analysis. Therefore, the paper is organized by 
(1) presenting true-score measures and reliability for individual binary items as a function of their 
Rasch difficulty parameter, (2) "summarizing" the results for individual items to obtain evaluations 
for true-score measures at test level (e.g., error variance and true variance for the NR score), and 
(3) using the true-score measures at test level to evaluate the theoretical reliability for both norm- 
referenced and criterion-referenced reliability. 

With the dichotomous Rasch model, the probability for correct answer on item / with 
difficulty 5;for a person with a trait score 0 is 



exp(6-5i) 
l + exp(0-5i) 



( 1 ) 



As Pi(0) is also the true score on item / for a person at 0, the item score (marginal probability for 
correct response on the item) is 

= nPi(6M6)de, (2) 

where 9(0) is the probability density function (pdf) for the population trait distribution. With this, 
the marginal NR score for a test of n binary items is then 
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n 



= Z 



(3) 



and the expected domain score is ti = |i/n (in terms of percentages: % = lOOp/n). 

Also, P,(6)[l - Pi(6)] is the error variance for a binary item i at 6 (Lord, 1980, p. 52). 
Therefore, the (marginal) item error variance is 



The (marginal) error variance for the NR score on a test of n dichotomous items is then 



It is important to emphasize that <3^ represents the accuracy of number-right scores and is not to 
be confused with the mean square measurement error (MSEp) that represents the accuracy of 
trait scores on the logit scale with Rasch measurement models (e.g.. Smith, 2001). Also, while the 
MSEp is a sample statistic that requires information about the person’s trait score, 6, (sf does not 
require such information because it is obtained through integration over the trait interval. 

Closed form integral evaluations for Tijand o^(e;) in Equations 2 and 4, respectively, are 
provided in the next section. The population distribution for the underlying trait is assumed to be 
normal or logistic. The pdf of a logistic distribution (e.g., Evans, Hastings, & Peacock, 1993, p. 
98) with the location at the origin of the scale is 



a2(ei)= f Pi(e)[i-Pi(e)](p(0)de 



(4) 



n 




(5) 



i=l 



exp(0 / c) 



c[l + exp(0 / c)]^ ’ 



(p(0) = 



( 6 ) 
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where c is the scale parameter. This paper deals with two specific logistic distributions (c = 1 or c 
= 1/2) that yield exact integral evaluations for Equations 2 and 4 and capture normal-like ability 
shapes that may occur in practice with Rasch measurement (see Figure 1), 




- 4 - 3 - 2-101234 

Ability (in logits) 

Figure 1. Probability density functions (PDF) of the standard 
normal distribution and two logistic distributions with scale 
parameters c = 1 and c = V 2 , 

Formulas at Item Level 

Item Score with the Normal Ability Distribution 

With P,(9) for the dichotomous Rasch model and (p(9) with iV(0, 1), an exact closed form 
evaluation for the integral in Equation 2 does not exist. Therefore, an approximation formula was 
developed in two steps. First, using the computer program MATLAB (MathWorks, Inc,, 1999), 
quadrature method evaluations were obtained for values of the Rasch item difficulty, ^jin the 
interval from -6 to 6 with an increment of 0,01 on the logit scale. Second, the results were 
tabulated and then approximated with the four-parameter sigmoid function using the regression 
wizard of the computer program SigmaPlot 5,0 (SPSS Inc,, 1998), The resulting approximation 
formula (with an absolute error smaller than 0,02) for the expected item mean is 
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%i =-0.0114 + 



1.0228 

l + exp(5i / 1.226)' 



(7) 



Formula 7 can be used with any normal trait distribution, N(|.^; Gq), after transforming the item 
difficulty estimate: = (5; - iie)/ae (e.g., Smith, 2000). For n binary items, the expected number 

right-score, p, is obtained with Equation 3; (the expected domain score is 7C = p/«). 

Item Score with Logistic Ability Distributions 

With c = 1, Equation 2 [with Pi(0) from Equation 1 and (p(0) from Equation 6] becomes 



J —a 



exp(0 - 5i)exp(0) 



[1 + exp(0 - 5j )][1 + exp(0)]‘' 



■de. 



( 8 ) 



With the substitution t = exp(0), the integral evaluation in Equation 8 becomes straightforward 
and (with simple algebra) leads to an exact formula for the expected mean on individual items: 



(Sj -l)exp(5j) + l 
[exp(5i)- 1]^ 



(9) 



With c = Ml, Equation 2 becomes 



r” 2exp(e-6j)exp(2e) g 

J -CO [1 + exp(0 - 5 i )][1 + exp(20)]2 



Again, using the substitution t = exp(0), a straightforward integration leads to an exact formula: 



7texp(5i)[exp(2Si)-l]-2(25i -l)exp(2Sj) + 2 
2[l + exp(25i)]^ 



( 11 ) 
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where the constant k («3.1416) is not to be confused with the notation for the domain score. 
Item Error Variance with the Normal Trait Distribution 

With (p(0) for the standard normal pdf. Equation 4 can be written 



As an exact closed form evaluation for the integral in Equation 12 does not exist, an 
approximation was developed using the technique described with the development of Formula 7. 
The resulting approximation formula for the error variance of individual items is 



where: ^ = 0.01 1,5 = 0.195, and C= 1.797, if|5i| <4, 
or^ = 0.0023, 5 = 0.171, and C = 2.023, if |5i| ^ 4. 

As Formula 13 shows, o^(ei) is an even function of the item difficulty, i.e., the value of is the 
same for 5j and -5;. Depending on the value of 5;, the absolute error of approximation with 

Formula 13 ranges from 0 to 0.0008, with a mean of 0.0002 and a standard deviation of 0.0002. 
Also, the errors vary in sign thus canceling out to a large degree when the estimates of o^(e) with 
Formula 13 are summed to obtain the error variance for the number-right score, (3f (Equation 5). 
Item Error Variance with Logistic Ability Distribution 

This section provides exact formulas for a^(e) with the fixed logistic distributions of 0 
used in this article (c = 1 and c = 1/2). The mathematical derivations (provided in Appendix A) 
lead to the following exact evaluations of the expected item error variance, where E;= exp(5): 




( 12 ) 



cr2(ei) = A+Bexp[-0.5(5i /C)^], 



(13) 



1. Withc= 1, 



_ Ei(5iEi -2Ei -i-5i -1-2) 

a - 
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For 6; = 0, one should use o^(ei) = 0.1667 (the limit evaluation with Sj-O) to avoid "division by 
zero" with Formula 14 (see Appendix A). 

2. With c = 1/2, 



Ei [8(1 - 5 i )Ef + 8(6 i + 1)E; + TiEf - 6nE^ + 7t] 

^ 2(Ef + 1)3 ■ 



‘(ei) = 



(15) 



The sum of 0 ^( 0 ) for the test items is the expected error variance for the number-right score, 

Item True Variance 

Let be the variance of the true score on item i at 0, Pi(0), as 0 varies from -<» to «>. 
This item true variance relates to the item score, and item error variance, a^(ej), as follows 

) = 7Ii(l - TIi) - 0^(6;). (16) 



Proof: Using the expectation rule VAR(X) = E(X^) - [E(X)]^ with X = P;(0), we have 

[Pi(9)]^(p(e)d6- [Pi(e)M6)de 

J-oo V J-oo y 

= r {p.(e)-Pi(e)[i-Pi(e)]}(p(e)de-7t2 

J— 00 

= rPi(e)q>(e)de- rp,(e)[i-Pi(e)](p(9)de-7t^ 

J-OO J-oo 



= 7Ui -a^(ei)-7uf = 71,(1- 71,) -C7^(e,). 



Item Reliability 

Besides reliability coefficients at test level, indices of reliability at item level can also be 
useful in test development and analysis. Under TSM, the reliability of item i is usually estimated 
with the product sf -^, where 5; is the item-score standard deviation and r;x is the point-biserial 
correlation between the item score and the total test score (e.g., Allen & Yen, 1979, p. 124). T his 
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paper uses the definition "true item variance to observed item variance" for reliability of individual 
items, Pij Therefore, the reliability for Rasch calibrated items is evaluated here with 



Pii = 



(i2(Xi) + (72(ei)’ 



(17) 



where a^(Tj) is obtained with Formula 16 and 0 ^( 0 ;) with Formula 13, when 6 ~ N(0,l), or 
Formulas 14 and 15 when 6 is with the logistic distribution for c = 1 and c = 1/2, respectively. 
Information about item reliability can be particularly useful when the purpose is to select items 
that maximize the internal consistency reliability (e.g., Allen & Yen, 1979, p. 125). 

Formulas at Test Level 

Marginal NR score 

For a test of n binary items, the marginal NR score, g, is provided with Equation 3, where 
the additive components (item scores, Tt;) are obtained through the use of Formula 7 (the normal 
trait distribution). Formula 9 (the logistic trait distribution, C;= 1), or Formula 10 (the logistic trait 
distribution, C;= 1/2). 

Error variance for the NR Score 

The (marginal) error variance for the NR score, af is provided with Equation 5,where the 
additive components [item error variance, cf{e f] are obtained through the use of Formula 13 
(the normal trait distribution). Formula 14 (the logistic trait distribution, c= 1) or Formula 15 
(the logistic trait distribution, C;= 1/2). 

True Score Variance for the NR Score 

The true score variance for the NR score, csf, does not result from a direct summation of 
true variances for individual items, a^(Tj). As proven here below, the theoretical value of of is 



n n 



a 



i=l j=l 



(18) 
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Proof. With unidimensional tests (which are dealt with in this paper), there is a perfect 
correlation between the congeneric true scores on two items, say X; and Xj, because of the linear 
relationship: X;= fly + ijj Xj, where iy ?^0, 1 (e.g., Jdreskog, 1971). The covariance of X; and Xj, 
then, is: a(xj, Xj) = a(x;)a(Xj). Therefore, for the variance of the true number-right score on a n-item 
test, X (= Sxj), we have 



i=l j=l i=l j=l 

Equation 19 leads directly to Formula 18 by replacing a(x) and a(Xj) with their expressions from 
Equation 16. It should be noted also that Formulas 16 and 18 hold for any trait distribution since 
their derivations remain the same with any (p(9). 

Reliability for Norm-Referenced Interpretations 

Under TSM, the reliability of measurement is defined as the ratio of true score variance to 
observed score variance 



Pxx = ^ (20) 

For internal consistency evaluations, p^^is typically estimated by the Cronbach’s coefficient alpha 

or by the KR20 coefficient for dichotomously scored items (Kuder & Richardson, 1937). 
However, even at population level, Cronbach’s alpha (or KR-20) is an accurate estimate of p^^ 
only if there is no correlation among errors and the test components are at least essentially tau- 
equivalent (Novick & Lewis, 1967). For Rasch calibrated items, one can determine p^/rom 
Equation 20 by replacing and with their population estimates using formulas developed in 
the previous sections. This approach, unlike Cronbach’s alpha, does not require essential tau- 
equivalency (the weaker assumption of congeneric measures is sufficient) thus eliminating factors 
that may negatively affect the population estimate of Px;^ As a reminder, essentially tau-equivalent 
items are assumed to have equal true-score variances, whereas congeneric measures may have 
different scale origins and may vary in precision (Joreskog, 1971). Previous research addresses 
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differences between some empirical estimates of p^x and the Rasch person separation reliability, 
Rr (s g > Clauser, 1999; Linacre, 1996, 1997). Both and Rr represent the ratio of "true 
variance to observed variance" but with Pxxthe variances are for raw scores, whereas with Rr they 
are for trait scores (logits). Linacre (1996) reports that the true-score reliability (KR-20 or 
Cronbach’s alpha) is generally higher than Rr, whereas the statistical Rasch validity exceeds its 
true-score counterpart. Also, the raw-score standard errors of extreme scores are close to zero, 
whereas extreme scores are usually excluded in Rasch analysis because their measure standard 
errors on the logit scale are infinite (e.g., Clauser, 1999). 

Reliability for Criterion-Referenced Interpretations 

Brennan and Kane (1977) introduced a dependability index, 0(X), for criterion-referenced 
interpretations in the framework of generalizability theory (GT; e.g., Brennan, 1983) 



1 ( 1 ) = 



+ (ti - 



o^(;?)+ (ti - l)'^ + o^(A) ’ 



( 21 ) 



where o^(p) is the universe-score variance for persons, o^(A) is the absolute error variance, % is 
the domain score, and X is the cutting score; (all scores are in proportion of items correct). In the 
context of the GT design "person x items", a^(A) = cf(pi,e)ln -I- o\i)ln, where n is the number of 
items (e.g., Shavelson & Webb, 1991, p. 86). When X = ti, the index <I)(X) reaches its lower limit 
referred to as index O in GT. Feldt and Brennan (1993) noted that "the index 0(X) characterizes 
the dependability of decisions based on the testing procedure, whereas the index <I> characterizes 
the contribution of the testing procedure to the dependability of such decisions" (p. 141). 

Taking into account that is the tme variance for the person’s number-right score (see 
Formula 18), whereas c^ip) in Formula 21 is the tme variance of the person’s proportion of items 
correct, we have: a^(p) = hf. On the other side, a^(/) = 0 ^( 71 ; ) because they both represent the 
variance of the expected item mean, tij, across n items. Also, taking into account that cr^(ei) is the 
error variance for the number-right score, the absolute error variance can be represented with 
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o\A)= a^(ei)/«^ + With this, Formula 21 translates into 




( 22 ) 



When X, = 7t, 0(X) reaches its lowest limit (index O) 



0 = 




(23) 



a? + Gg + na^(Tti) 



The comparison of Formulas 20 and 23 shows that O does not exceed This is consistent with 
the argument of Feldt and Brennan (1993) that "criterion-referenced interpretations of ‘absolute’ 
scores are more stringent than norm-referenced interpretations of ‘relative’ scores" (p. 141). It is 
important to emphasize that the estimation of p^^ O, and 0(X) in GT requires information about 
the raw scores for a sample of examinees, whereas the formulas developed in this article do not 
require such information as long as the Rasch item calibration is available. 



This example illustrates the estimation of expected tme-score measures and reliability (at 
item and test level) using the formulas developed in this article for Rasch calibrated binary items. 
The example is organized in two sections. The first section provides (in algorithmic order) the 
expected measures and formulas used for their estimation with the normal trait distribution. The 
execution of the formulas in this section is conducted through the use of the statistical package 
SPSS (SPSS Inc, 1997). The SPSS syntax developed for this purpose is provided in Appendix B. 
The second section of this example compares the expected tme-score measures and reliability to 
their empirical counterparts obtained with simulated data. 

Theoretical Evaluation of True-Score Measures with Formulas 

This section illustrates how researchers and practitioners may use the Rasch calibration of 
binary items to evaluate expected tme-score measures and reliability at both item and test level. 



Example 
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The Rasch difficulty parameters, 5j, for 20 hypothetical items are provided in Table 1 ; (5; sum to 
zero and cover uniformly the interval from -2.2 to 2.5 on the logit scale). The expected measures 
and the formulas used for their evaluation with 0 ~A^(0,1) are listed below in algorithmic order. 

1 . Item mean, n-, - Formula 7. 

2. Item error variance, - Formula 13. 

3. Item true variance, a^(Tj) - Formula 16. 

4. Item reliability, p^ - Formula 17. 

5. Marginal number-right score, p - Formula 3; (the domain score is :t = p/«). 

6. Error variance for the number-right score, (sf - Formula 5. 

7. True score variance for the number-right score, - Formula 18. 

8. Descriptive variance of the item scores, a^(n:) - the variance of (see Step 1). 

9. Reliability, p^^ - Formula 20. 

10. Dependability index O - Formula 23. 

1 1 . Dependability index 0(X) - Formula 22. 

The SPSS printout (with the syntax in Appendix B and item parameters, 5j, in Table 1) provides 
the expected true variance for the number-right score {<3^ = 10.4200), the error variance for the 
number-right score = 3.1046), the expected number-right score (p = 10.1027), and the 
variance of expected item means, a^(n:i) = .071. Using these values, we obtain: n = p/« = .5051, 

Pxx “ -7704 (with Formula 20), and O = .6972 (with Formula 23). Also, using Formula 22, values 
of the dependability index 0(X) are calculated and graphed for values of the cutting score, X, that 
vary from 0 to 1 on the domain scale with an increment of 0.005 (see Figure 2). The graphical 
representation of 0(X) shows, for example, that its lowest value (O = .6972) occurs when the 
cutting score equals the population domain score {X = n= .5051). Also, 0(X) = .85 for X = .1 and 
0(X) exceeds .90 when the cutting score is above .8 (i.e., 80% in percentages). This type of 
information is very useful for criterion-based interpretations and decisions with mastery tests. 

The SPSS syntax (see Appendix B) provides also the expected true-score measures and 
reliability for individual items. They appear as "new" variables in the SPSS data spreadsheet, with 
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notations that should be interpreted as follows: var_e = o^(e), p = Ttj, var_tau = o^(Ti), and roi = 

Pii; (the values of these output variables are provided in Table 1). 

Empirical Validation of the Formulas with Simulated Data 

The expected measures obtained in the previous section are compared now with their 
empirical counterparts obtained with simulated data. Specifically, binary scores were generated to 
fit the Rasch model [with the item parameters, 5;, in Table 1 and 0 ~N(0,l)] using a computer 
program written in SAS (SAS Institute, 1985) for Monte Carlo simulations (Dimitrov, 1996). The 
(ANOVA-based) generalizability model "person x item" (p x i) incorporated in this program was 
run with 20 replications generating binary scores for 1,5000 persons in each replication. The 
resulting empirical estimates of true-score measures and reliability (at test level) are summarized 
in Table 2. The comparison of these empirical estimates with their theoretical counterparts (also 
presented in Table 2) shows a close match. The same holds for the comparison of the expected 
item means, tcj, with their empirical counterparts (p^) obtained for the SAS simulated binary 
scores (see Table 1). Thus, with Rasch calibrated items, the formulas developed in this article 
provide (without data) estimates of true-score measures and reliability that one can obtain (with 
"ideal" data simulated for large samples) using the "person x item" GT model. In addition, the 
formulas provide the marginal values of true-score measures and reliability for individual items 
[a^(Xi), o^(ej), and PiJ that are not provided with the GT model. 

The Rasch person separation reliability index , Rr, was also calculated for the generating 
measures and item difficulties with the SAS simulations. Linacre (1997) refers to Rr obtained with 
generated 0- measures as generator-based Rasch reliability and shows that it is an upper limit for 
data-based Rr. The generator-based reliability with the SAS simulations in this example was 
found to be Rr = .673. The fact that the theoretical (.770) is higher than Rr (.673) in this 
example is not a surprise given that even empirical estimates of Pxx(KR-20 or Cronbach’s alpha) 
generally exceed (Linacre, 1996). 
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Table 1 

True-Score Measures and Reliability for Individual Items 
Evaluated as a Function of Their Rasch Difficulty, 5j . 



Item 


5i 




(Pi)° 


a' 


■(T^i) Pii 


1 


-2.2000 


.1032 


.8656 (.8620) 


.0132 


.1131 


2 


-2.0000 


.1160 


.8440 (.8446) 


.0157 


.1191 


3 


-1.8200 


.1278 


.8224 (.8133) 


.0183 


.1251 


4 


-1.5300 


.1467 


.7833 (.7920) 


.0231 


.1358 


5 


-1.4000 


.1550 


.7639 (.7560) 


.0254 


.1408 


6 


-1.2500 


.1641 


.7402 (.7287) 


.0282 


.1466 


7 


-1.0500 


.1754 


.7065 (.6787) 


.0320 


.1541 


8 


-.8500 


.1854 


.6705 (.6827) 


.0356 


.1610 


9 


-.6100 


.1951 


.6247 (.6054) 


.0394 


.1679 


10 


-.2500 


.2041 


.5520 (.5600) 


.0432 


.1746 


11 


.0000 


.2060 


.5000 (.4874) 


.0440 


.1760 


12 


.2800 


.2036 


.4419 (.4463) 


.0430 


.1742 


13 


.4500 


.2000 


.4072 (.3860) 


.0414 


.1715 


14 


.8500 


.1854 


.3295 (.3400) 


.0356 


.1610 


15 


1.2100 


.1664 


.2663 (.2514) 


.0289 


.1481 


16 


1.3300 


.1593 


.2470 (.2353) 


.0267 


.1435 


17 


1.9700 


.1179 


.1594 (.1633) 


.0161 


.1201 


18 


2.1500 


.1063 


.1395 (.1300) 


.0138 


.1145 


19 


2.2200 


.1019 


.1323 (.1180) 


.0129 


.1125 


20 


2.5000 


.0851 


.1064 (.1093) 


.0100 


.1049 



Note, a^(ej) is the expected error variance, 7t; - expected mean, 
(Px - empirical mean), c^(t) - expected true variance, and py 
- expected reliability for individual items. 




i7 



Obtained for the SAS simulated binary scores. 
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Table 2 

Theoretical True-Score Measures and Reliability (Evaluated with Formulas) 
and Their Empirical Counterparts Evaluated with Simulated Data for the 
Rasch Item Difficulties (6) in Table 1 and the Normal Trait Distribution. 



Evaluation 


71 






0 ^( 71 ; ) 


Pxx 


<D 


Theoretical 

Empirical 


.5051 

.5067 


10.4200 

9.9572 


3.1046 

3.1647 


.0710 

.0708 


.7704 

.7548 


.6972 

.6802 



Note. The empirical estimates are obtained through averaging their values 
over 20 replications of SAS simulations for binary scores that fit the Rasch 
model with 1,500 persons per replication. 




Figure 2. The dependability index, 0(X.), estimated with 
Formula 21 for the theoretical true-score measures in Table 2 
with the illustrative example. 
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Conclusion 

This paper provides formulas for true-score measures and reliability of binary scores as a 
function of the Rasch item difficulty for fixed distributions (normal or logistic) of the underlying 
trait. The scale parameters c = 1 and c = 1/2 were selected for the two fixed logistic distributions 
because they yield exact integral evaluations and produce normal-like trait distributions of the 
underlying trait that may occur with Rasch measurements; (this is not true with just any scale 
parameter of the logistic distribution). Formulas 7 and 13 for Tt, and a^(e;), respectively, with the 
normal trait distribution are developed by the use of approximation procedures, whereas all other 
formulas result from exact integral evaluations. The example in the previous section illustrates an 
application of the formulas for Rasch calibrated items. The calculations are easy to perform using 
statistical programs such as SAS and SPSS (see Appendix B), spreadsheet-based programs, or 
even hand calculators. The formulas can also be efficiently incorporated into computer programs 
for test analysis and measurement simulations. 

The formulas developed in this paper have theoretical and practical value for Rasch test 
development, score analysis, and simulation studies. Their closed analytical forms may reveal 
relationships that are difficult or impossible to see with empirical tools (e.g.. Formula 13 shows 
that the item error variance has the same value for opposite, 5, and - 5,, Rasch item difficulties). 
Also, given a bank of Rasch calibrated items, one can select items to develop a test with known 
true-score measures and reliability for a person population prior to administering the test. One can 
also compare (without using raw scores or trait measures) the expected domain scores and 
reliability for test strands in which items are grouped by substantive characteristics (e.g., content 
areas or learning outcomes). In another scenario, the formulas can be used to evaluate {prior to 
administration) test booklets that are developed for follow-up measurements (e.g., in longitudinal 
studies) given the Rasch calibration of items at the base year. In simulation studies, researchers 
may use the formulas to generate true-score characteristics and reliability for targeted values of 
Rasch item difficulty without the necessity of generating binary scores or 0-scores for persons. 

The examples of possible applications of the formulas developed in this paper illustrate 
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what researchers and practitioners can gain over and above what they would learn from the Rasch 
analysis. It is important to emphasize that the proposed formulas and the Rasch analysis provide 
different types of information that can efficiently complement (not replace or exclude) each other 
in test development and analysis. For example, while the Rasch analysis is effective at locating 
persons on the underlying trait (Linacre, 1996), the formulas developed in this article are effective 
at determining population true-score characteristics for Rasch calibrated items without using raw 
scores or trait measures for examinees. Also, while the Rasch measures of reliability (R^) and 
"separation" provide information about measurably different levels of performance in a sample of 
examinees (e.g., Wright, 1996, 1998), the index 0(X,) provides information about the 
dependability of criterion-referenced decisions. Which approach to use (Rasch analysis, true-score 
analysis with the proposed formulas, or both) depends on the goals of the study as well as on the 
data that is available (raw scores, trait scores, or only estimates of Rasch item difficulty). 

One can also argue that estimates of true-score measures and reliability can be obtained 
within the framework of generalizability theory using, for example, computer programs such as 
GENOVA (Crick &, Brennan, 1983). This approach, however, (a) requires the binary scores fora 
large sample of examinees and (b) does not provide tme-score measures at item level such as 
a^(ej), o^(Xi), and py. Therefore, for Rasch calibrated items, the formulas developed in this paper 
provide (without data) richer, more accurate, and easily obtained information about true-score 
measures and reliability at population level relative to (ANOVA-based) generalizability methods. 

Skewed trait distributions also occur with Rasch measurement (e.g., in medical studies; 
Wright, 2001). Dimitrov (2001) provided formulas for the expected error variance with some 
skewed trait distributions. Formulas 16 and 18 for the true score variance can also be used with 
skewed distributions because their derivation holds with any (p(0). In conclusion, using Rasch 
calibration of items to evaluate their expected true-score measures, reliability, and dependability 
extends the traditional boundaries in calculating, interpreting, and reporting measurement results. 




20 
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Appendix A 

Derivation of Formulas 14 and 15 for Item True Variance with Logistic Trait Distribution 

For Pi(0) with the dichotomous Rasch model (Equation 1), we have 



p,(e)[i-p,(e)]= 



exp(9-5j) 

[l+exp(e-5i)f 



(Al) 



which (as one can easily see) is also the first derivative of 1(0). With this, Equation 4 becomes 



(ei)= [ raPi(0)/ae](p(e)de= [ 9(e)dPi(e). 

J —CO J — oo 



(A2) 



As one may also notice, the logistic cp(0) in Equation 6 is the first derivative of the function 



m= 



exp(0 / c) 

1 + exp(0 / c) 



Replacing cp(0) in Equation A2 with the first derivative of 0(0), we have 



*00 c 

(j2(ei)= [aPi(0)/a0][aa>(0)/a0]d0= [sr (0) / a0]d®(0) . (A3) 

J — oo J — oo 



With integration by parts for the integral in Equation A3, we have 



.2 



< = [S’f{9)i d9\He) 



00 



r 00 



-00 J_QQ 






= 0 - 

J-00 



• 00 exp(^/ c) exp(^- 5 ^ )[1 - exp(^- 5 ^ )] 
-00 [1 + exp(^ / c)] [1 + exp((9 - 5 ^ )] ^ 



de. 



er|c 
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Let Ei = exp(5i). Using the substitution rule for integration with x = exp(0), we obtain 



a 



2 




— ! ^dx. 



(A4) 



The evaluation of the integral in Equation A4 for c = 1 or c = 1/2 is straightforward and yields to 
1. Withe = 1, 




E{(5{E{ — 2E{ + 5{ 2) 

(E. - 1)' 



(A5) 



When 5i= 0, the denominator of the ratio in Formula A5 equals zero. For this particular 
case, estimating the limit of the ratio at 5j= 0, we obtain 0 ^( 0 ;) = 0. 1667. 

2. With c = 1/2, 



o^(ei) = 



E|[8(l- 8i)Ei^ + 8(8i + l)Ej + TiEj^ - 67tEi^ + 7t] 

2(Ej^ + 1)^ 



(A6) 



where ti is a constant (ti = 3.14159... is not to be confused with the domain score) and E; denotes 
exp(5j) for simplicity of the analytical form. As one may notice. Formulas A5 and A6 are exactly 
Formulas 14 and 15, respectively, with which the derivation is completed. 
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Appendix B 

SPSS Syntax for Evaluation of True-Score Measures of Rasch Calibrated Binary Items 
with the Normal Trait Distribution; (Input variable: b, the Rasch item difficulty) 



DO IF (ABS(b) < 4). 

COMPUTE ve = .011 + .195*exp(-.5*((b/1.797)**2)). 

ELSE. 

COMPUTE ve = .0023 + .171*exp(-.5*((b/2.023)**2)). 

END IF. 

COMPUTE p = -.0114 + 1 .0228/(1 + exp(b/1 .226)). 

COMPUTE vt = p*(1 - p) - ve. 

IF(vt < 0) vt = 0. 

SET FORMAT = F8.4 ERRORS = NONE RESULTS OFF HEATHER NO. 

FLIP 

VARIABLES b ve p vt. 

VECTOR V = VAR001 TO VAR020. 

COMPUTE Y= 0. 

LOOP #1 = 1 TO 20. 

LOOP#J = 1 TO 20. 

COMPUTE Y = Y + SQRT(V(#I)*V(#J)). 

END LOOP. 

END LOOP. 

FLIP VAR001 TO VAR020 Y. 

COMPUTE roi = vt/(vt + ve). 

SET RESULTS ON. 

REPORT FORMAT = AUTOMATIC 
A/ARIABLES = ve ' ' p ' ' vt ' ' 

/BREAK = (TOTAL) 

/SUMMARY = MAX(vt) True score variance:' 

/SUMMARY = SUBTRACT(SUM(ve) MAX(ve)) (vt (COMMA) (4)) 'Error variance:' 
/SUMMARY = SUBTRACT(SUM(p) MAX(p)) (vt (COMMA) (4)) 'Expected mean:' . 
SELECT IF(CASE_LBL ~= 'Y' ) . 

RENAME VARIABLES (CASE_LBL = ITEM) (ve = var_err) (vt = varjau). 

VARIABLE LABELS p 'Expected item mean' . 

DESCRIPTIVES 
VARIABLES = p 
/STATISTICS = VAR . 




Note. The number of items (in this example, 20) should be specified in the syntax by the user. 
With 50 items, for example, change 20 to 50 and VAR020 to VAR050 (see the bold notations in 
the respective four syntax lines). 
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